Back

JCO Clinical Cancer Informatics

American Society of Clinical Oncology (ASCO)

Preprints posted in the last 7 days, ranked by how well they match JCO Clinical Cancer Informatics's content profile, based on 18 papers previously published here. The average preprint has a 0.04% match score for this journal, so anything above that is already an above-average fit.

1
Onca: An Open 9B Language Model for Pancreatic Cancer Clinical Tasks

Shim, K. B.

2026-04-24 oncology 10.64898/2026.04.16.26351055 medRxiv
Top 0.1%
18.3%
Show abstract

Pancreatic ductal adenocarcinoma (PDAC) remains one of the deadliest solid tumors and continues to face low treatment-trial participation, fragmented evidence workflows, and labor-intensive ab- straction of unstructured clinical text. Existing oncology-focused language models show promise, but many depend on private institutional corpora, limiting reproducibility and practical reuse across centers. We present Onca, an open 9B dense model designed for four PDAC-relevant tasks: trial eligibility screening, case-specific clinical reasoning, structured pathology report extraction, and molecular variant evidence reasoning. Onca is fine-tuned from Qwopus3.5-9B-v3 with a single Un- sloth BF16 LoRA adapter on 37,364 training rows drawn from openly available sources. The evalu- ation spans 11 panels and compares Onca against Woollie-7B, CancerLLM-7B, OpenBioLLM-8B, and the unmodified Qwopus base. Onca achieves the strongest overall results on Trial Screening (81.6 F1), Clinical Reasoning (14.1 composite), Pathology Extraction (30.5 field exact-match), Pub- MedQA Cancer (68.3 macro-F1), and PubMedQA (66.5 macro-F1). The strongest gains appear in tasks closest to routine oncology workflow, especially trial review and pathology structuring. These findings suggest that clinically targeted pancreatic-cancer language models can be built from open data with competitive performance while remaining practical to train on a single workstation-scale GPU setup.

2
A Context-Aware Target Engagement and Pharmacodynamic Biomarker Resource to Accelerate Drug Discovery and Development

Yang, Y.; Zhao, L.; Orouji, S.; Zhu, Y.; Johnson, R. L.; Maxwell, D. S.; Mica, I.; Russell, K. P.; Al-lazikani, B.

2026-04-22 bioinformatics 10.64898/2026.04.19.719411 medRxiv
Top 0.1%
12.2%
Show abstract

Confirming target engagement in tumor experimental models remains a major challenge in oncology drug development. Pharmacodynamic biomarkers can help address this, but few systematic resources link drug targets to candidate biomarkers. We developed TargetTrace, a comprehensive resource to identify and prioritize pharmacodynamic biomarkers across nine key target classes, including transcription factors/cofactors, kinases, phosphatases, ubiquitin ligases, deubiquitinases, acetyltransferases, deacetylases, methyltransferases, and demethylases. Biomarker candidates were gathered from curated molecular interaction resources and refined using external annotations to improve accuracy. For enzyme targets with measurable substrate changes, we applied a two-agent large language model workflow, followed by manual review, to harmonize antibody information from the antibody resources and ensure that the selected biomarkers are measurable with existing laboratory tests. From more than 92,000 input interactions and over 2,300 targets, we compiled 71,323 target-biomarker relationships involving 2,270 potential drug targets, encompassing both transcription factor/cofactor-target gene and enzyme-substrate interactions. Commercial antibodies were available for over 1,400 biomarkers, supporting laboratory validation. This resource provides a structured and reusable resource for systematic identification and prioritization of pharmacodynamic biomarkers in oncology.

3
Methodological and Clinical Validation of TholdStormDX v0.0.1: An Advanced Stochastic Engine for the Optimization of Thresholds and Multimarker Panels Applied to Oncology

Reinosa, R.

2026-04-27 oncology 10.64898/2026.04.24.26351692 medRxiv
Top 0.1%
8.2%
Show abstract

Introduction: The translation of biomarkers into binary clinical decisions requires the determination of precise cut-off points. This study validates the TholdStormDX v0.0.1 tool, a mathematical engine that employs Dual Annealing, 2- and 4-parameter logistic fitting, and vectorized Monte Carlo simulations for panel optimization under Boolean OR logic. Methods: The tool was evaluated using datasets from four diagnostic domains (Pulmonary Nodules, Hepatocellular Carcinoma [HCC], Cervical Cancer, and Breast Cancer), along with a prognosis-oriented analytical context (Breast Cancer). Validation followed a strict workflow: characterization and selection of the best individual and combined thresholds in the Training (Train) and Validation (Val) sets, using the Test set in a completely independent manner, solely to assess the model s performance and generalizability. Results: The tool enabled precise derivation of cut-off points for both individual biomarkers and multivariable combinations. Evaluation on the Test set objectively demonstrated in which scenarios a single biomarker outperforms a complex panel, promoting clinical parsimony. For example, in Breast Cancer diagnosis, an individual predictor outperformed the optimized panel (Sensitivity: 0.953 / Specificity: 0.952 in Test); conversely, in Hepatocellular Carcinoma, the multivariable combination showed superior performance compared to the single marker (Sens: 0.707 / Spe: 0.718 in Test). Additionally, the self-auditing system effectively flagged metric degradation when noisy variables were included, preventing potential issues. Conclusion: TholdStormDX v0.0.1 proves to be a robust and transparent bioinformatics platform for deriving clinical thresholds. Its main contribution lies in mitigating local minima and promoting clinical parsimony, enabling researchers to objectively identify when a single biomarker is sufficient and when a panel provides real added value. Furthermore, it transforms the problem of biological noise into a safety feature: by systematically warning about algorithmic instability, it prevents overfitting and ensures the clinical viability of medical decisions. Availability: The software is free and distributed under the GNU GPLv3 license. TholdStormDX v0.0.1 is written in Python, and its source code is available at the following GitHub address: https://github.com/roberto117343/TholdStormDX.

4
CohortContrast: An R Package for Enrichment-Based Identification of Clinically Relevant Concepts in OMOP CDM Data

Haug, M.; Ilves, N.; Umov, N.; Loorents, H.; Suvalov, H.; Tamm, S.; Oja, M.; Reisberg, S.; Vilo, J.; Kolde, R.

2026-04-23 health informatics 10.64898/2026.04.22.26351461 medRxiv
Top 0.1%
6.3%
Show abstract

Abstract Objective To address the unresolved bottleneck of selecting cohort-relevant clinical concepts for treatment trajectory analysis in observational health data, we introduce CohortContrast, an OMOP-compatible R package for enrichment-based concept identification, temporal and semantic noise reduction, and concept aggregation, enabling cohort-level characterization and downstream trajectory analysis. Materials and Methods We developed CohortContrast and applied it to OMOP-mapped observational data from the Estonian nationwide OPTIMA database, which includes all cases of lung, breast, and prostate cancer, focusing here on lung and prostate cancer cohorts. The workflow combines target-control statistical enrichment, temporal/global noise filtering, hierarchical concept aggregation and correlation-based merging, with optional patient clustering for downstream trajectory exploration. We validated the approach with a clinician-based plausibility assessment of extracted diagnosis-concept pairs and evaluated a large language model (LLM) as an auxiliary filtering step. Results We analyzed 7,579 lung cancer and 11,547 prostate cancer patients. The workflow reduced concept dimensionality from 5,793 to 296 concepts (94.9%) in lung cancer and from 5,759 to 170 concepts (97.0%) in prostate cancer, and identified three exploratory patient subgroups in both cohorts. In a plausibility assessment of 466 diagnosis-concept pairs, validators rated 31.3% as directly linked and 57.5% as indirectly linked. Discussion CohortContrast reduces manual concept curation by prioritizing and aggregating cohort-relevant concepts while preserving clinically interpretable treatment patterns in OMOP-based real-world data. Conclusion CohortContrast enables scalable reduction of broad OMOP concept spaces into clinically interpretable, cohort-specific representations for exploratory trajectory analysis and real-world evidence research.

5
Generalizable Deep Learning Framework for Radiotherapy Dose Prediction Across Cancer Sites, Prescriptions and Treatment Modalities

Chang, H.-h.; Cardan, R.; Nedunoori, R.; Fiveash, J.; Popple, R.; Bodduluri, S.; Stanley, D. N.; Harms, J.; Cardenas, C.

2026-04-22 radiology and imaging 10.64898/2026.04.17.26350770 medRxiv
Top 0.1%
4.9%
Show abstract

Optimizing radiotherapy dose distributions remain a resource-intensive bottleneck. Existing AI-based dose prediction methods often have limited generalizability because they rely on small, heterogeneous datasets. We present nnDoseNetv2, an auto-configured, end-to-end framework for dose prediction across diverse disease sites (head and neck, prostate, breast, and lung), prescription levels (1.5-84 Gy), and treatment modalities (IMRT, VMAT, and 3D-CRT). By integrating machine-specific beam geometry with 3D structural information, the framework is designed to generalize across varied clinical scenarios. A single multi-site model was trained on 1,000 clinical plans. On sites seen during training, performance was comparable to specialized site-specific models. On unseen sites (liver and whole brain), the model outperformed site-specific models, with mean absolute errors of 2.46% and 6.97% of prescription, respectively. These results suggest that geometric awareness can bridge disparate anatomical domains while eliminating the need for site-specific model maintenance, providing a scalable and high-fidelity approach for personalized radiotherapy planning.

6
MedSafe-Dx (v0): A Safety-Focused Benchmark for Evaluating LLMs in Clinical Diagnostic Decision Support

Van Oyen, C.; Mirza-Haq, N.

2026-04-21 health informatics 10.64898/2026.04.14.26350711 medRxiv
Top 0.2%
4.4%
Show abstract

MedSafe-Dx (v0), introduces a new safety-focused benchmark for evaluating large language models in clinical diagnostic decision support using a filtered subset of the DDx Plus dataset (N=250). MedSafe-Dx evaluates three dimensions: escalation sensitivity, avoidance of false reassurance, and calibration of uncertainty. Models were tasked with providing a ranked differential (ICD-10), an escalation decision (Urgent vs. Routine), and a confidence flag. Performance was measured via a "Safety Pass Rate," a composite metric penalizing three hard failure modes: missed escalations of life-threatening conditions, overconfident incorrect diagnoses, and unsafe reassurance in ambiguous cases. Eleven models were evaluated and revealed a significant disconnect between diagnostic recall and safety. GPT-5.2 achieved the highest Safety Pass Rate (97.6%), while several models exhibited high rates of missed escalations or unsafe reassurance. MedSafe-Dx provides a robust stress test for identifying high-risk failure modes in diagnostic decision support and shows that high diagnostic accuracy does not guarantee clinical safety. While the benchmark is currently limited by synthetic data and proxy labels, it provides a reproducible, auditable framework for testing AI behavior before clinical deployment. Our findings suggest that interventions such as safety-focused prompting and reasoning-token budgets could be essential components for the safe deployment of LLMs in clinical workflows.

7
Consensus Through Diversity: A Comprehensive Benchmark of Multi-Omic Approaches for Precision Breast Oncology

Sionakidis, A.; Pinilla Alba, K.; Abraham, J.; Simidjievski, N.

2026-04-21 bioinformatics 10.64898/2026.04.17.719159 medRxiv
Top 0.2%
4.3%
Show abstract

Emerging multi-omic profiling has made it feasible to subtype disease using multiple molecular layers. However, inconsistent preprocessing, heterogeneous implementations, variable evaluation, and limited reproducibility often constrain method selection. Here, we systematically benchmark 22 publicly available unsupervised approaches for bulk data on the TCGA-BRCA cohort across five modalities (RNA-seq, miRNA, DNA methylation, copy numbers, single nucleotide polymorphisms) and validate findings in two independent datasets, enabling a multi-layered comparison of performance, heterogeneous data support and interpretability. Most approaches fuse multi-omic data to produce a two-cluster solution largely aligned with ER status, with higher-resolution approaches further refining these into four coherent subclasses (angiogenic luminal, oxidative-phosphorylation/HER2-low luminal, immune-inflamed basal-like, and hyper-proliferative basal-like). Our benchmarking results indicate that methods based on similarity networks can efficiently produce stable, reliable partitions. Matrix factorisation and Bayesian factorisation algorithms produce rich latent representations, allowing quantification of feature and modality contributions, albeit at higher computational cost. Consensus clustering can be used on a case-by-case basis and refine partitions into more robust and generalisable findings. We aggregate our insights into a decision workflow that aligns with study goals, data characteristics, and computational resources, enabling optimal analytic strategies. This comprehensive assessment provides a practical roadmap for investigators seeking to extract reproducible, biologically meaningful subtypes from complex multi-omic datasets. We higlight the different technical and practical benefits and trade-offs that shape the selection and development of multi-omic approaches applied in precision oncology.

8
CT-Based Deep Foundation Model for Predicting Immune Checkpoint Inhibitor-Induced Pneumonitis Risk in Lung Cancer

Muneer, A.; Showkatian, E.; Kitsel, Y.; Saad, M. B.; Sujit, S. J.; Soto, F.; Shroff, G. S.; Faiz, S. A.; Ghanbar, M. I.; Ismail, S. M.; Vokes, N. I.; Cascone, T.; Le, X.; Zhang, J.; Byers, L. A.; Jaffray, D.; Chang, J. Y.; Liao, Z.; Naing, A.; Gibbons, D. L.; Vaporciyan, A. A.; Heymach, J. V.; Suresh, K. S.; Altan, M.; Sheshadri, A.; Wu, J.

2026-04-23 oncology 10.64898/2026.04.21.26351428 medRxiv
Top 0.2%
4.0%
Show abstract

Background: Immune checkpoint inhibitors (ICIs) have revolutionized cancer therapy but can cause serious immune-related adverse events (irAEs), with pneumonitis (ICI-P) being among the most severe. Early identification of high-risk patients before ICI initiation is critical for closer monitoring, timely intervention, and improved outcomes. Purpose: To develop and validate a deep learning foundation model to predict ICI-P from baseline CT scans in patients with lung cancer. Methods: We designed the Checkpoint-Inhibitor Pneumonitis Hazard EstimatoR (CIPHER), a deep learning foundation model that combines contrastive learning with a transformer-based masked autoencoder to predict ICI-P from baseline CT scans in patients with lung cancer. Using self-supervised learning, CIPHER was pre-trained on 590,284 CT slices from 2,500 non-small cell lung cancer (NSCLC) patients to capture heterogeneous lung parenchymal patterns. After pre-training, the model was fine-tuned on an internal NSCLC cohort for ICI-P risk prediction, using images from 254 patients for model development and 93 patients for internal validation. We compared CIPHER with classical radiomic models and further evaluated it on an external NSCLC cohort of 116 patients. Results: In the internal immunotherapy cohort, CIPHER consistently distinguished patients at elevated risk of ICI-P from those without the event, with AUCs ranging from 0.77 to 0.85. In head-to-head benchmarking, CIPHER achieved an AUC of 0.83, outperforming the radiomic models. In the external validation cohort, CIPHER maintained strong performance (AUC = 0.83; balanced accuracy = 81.7%), exceeding the radiomic models (DeLong p = 0.0318) and demonstrating higher specificity without sacrificing sensitivity. By contrast, the radiomic model showed high sensitivity (85.0%) but markedly lower specificity (45.8%). Confusion matrix analysis confirmed the robust classification performance of CIPHER, correctly identifying 80 of 96 non-ICI-P cases and 16 of 20 ICI-P cases. Conclusions: We developed and externally validated CIPHER for predicting future risk of ICI-P from pre-treatment CT scans. With prospective validation, CIPHER may be incorporated into routine patient management to improve outcomes.

9
Mechanistic learning to predict and understand minimal residual disease

Marzban, S.; Robertson-Tessi, M.; West, J.

2026-04-21 cancer biology 10.64898/2026.04.16.718968 medRxiv
Top 0.2%
4.0%
Show abstract

Mechanistic modeling has long been used as a tool to describe the dynamics of biological systems, especially cancer in response to treatment. Their key advantage lies in interpretability of relationships between input parameters and outcomes of interest. In contrast, machine learning techniques offer strong prediction performance, especially for high dimensional datasets that are common in oncology. Here, we employ a Mechanstic Learning framework that combines the advantages of both approaches by training machine learning models on mechanistic parameters inferred from clinical patient data. The mechanistic model (a Markov chain model) contains sixteen parameters that describe the rate of cell fate transitions that occur in patients with B-cell precursor acute lymphoblastic leukemia. The machine learning (a ridge logistic regression model) is trained on these parameters to predict two clinically-relevant features: BCR::ABL1 fusion gene status (positive or negative) and minimal residual disease status (positive or negative) post-induction chemotherapy. Model training is done in an iterative fashion to assess which (and how many) parameters are critical to maintain high predictive performance. Using machine learning models trained on the clinical flow-cytometry data, we find that the stem-like cell state alone is the most predictive feature for both BCR::ABL1-positive and MRD-positive disease, with combination scores (defined as the average of accuracy, balanced accuracy, and area under the curve) of 0.80 and 0.67, respectively. By comparison, mechanistic learning achieves comparable or improved combination scores for BCR::ABL1-positive and MRD-positive disease, with scores of 0.81 and 0.71, respectively, using only de-differentiation for BCR::ABL1 and primitive-state persistence together with differentiation-directed exit for MRD. Thus, the mechanistic-learning approach not only preserves predictive performance, but also provides a biological hypothesis for why stemness is predictive of these clinically relevant outcomes.

10
Decision Curve Analysis for Evaluating Machine Learning Models for Next-Day Transfer Out of ICU

Pozo, M.; Pape, A.; Locke, B.; Pettine, W. W.

2026-04-21 health informatics 10.64898/2026.04.19.26351213 medRxiv
Top 0.2%
3.7%
Show abstract

Timely identification of intensive care unit (ICU) patients likely to exit the unit can support anticipatory workflows such as chart review, eligibility screening, and patient outreach prior to transfer. Most ICU discharge prediction studies report discrimination and calibration, but these metrics do not quantify the decision consequences of acting on predictions. Using adult ICU admissions from MIMIC-IV, we represented each ICU stay as a sequence of daily clinical summaries and trained logistic regression, random forest, and XGBoost models to predict next day ICU transfer. Models achieved ROC AUC of 0.80-0.84 with differing calibration. We evaluated decision utility using decision curve analysis (DCA), where positive predictions trigger proactive review. Across thresholds, model guided strategies outperformed review-all, review-none, and a simple clinical rule. To translate net benefit into implementable operations, we modeled a clinical trial recruitment workflow with an 8 hour daily time constraint, incorporating chart review and consent effort. At a feasible operating threshold (0.23), the model flagged [~]23 charts/day and yielded [~]1.23 enrollments/day under conservative eligibility and consent assumptions. These results demonstrate that DCA provides a transparent framework for determining when ICU transfer predictions are worth using and how thresholds should be selected to align with real world workflow constraints. Data and Code AvailabilityThis research has been conducted using data from MIMIC-IV. Researchers can request access via PhysioNet. Implementation code is available upon request.

11
Large language models and retrieval augmented generation for complex clinical codelists: evaluating performance and assessing failure modes

Matthewman, J.; Denaxas, S.; Langan, S.; Painter, J. L.; Bate, A.

2026-04-24 health informatics 10.64898/2026.04.23.26351098 medRxiv
Top 0.3%
3.5%
Show abstract

Objectives: Large language models (LLMs) have shown promise in creating clinical codelists for research purposes, a time-consuming task requiring expert domain knowledge. Here, we evaluate the performance and assess failure modes of a retrieval augmented generation (RAG) approach to creating clinical codelists for the large and complex medical terminology used by the Clinical Practice Research Datalink (CPRD). Materials & Methods: We set up a RAG system using a database of word embeddings of the medical terminology that we created using a general-purpose word embedding model (gemini-embedding). We developed 7 reference codelists presenting different challenges and tagged required and optional codes. We ran 168 evaluations (7 codelists, 2 different database subsets, 4 models, 3 epochs each). Scoring was based on the omission of required codes, and inclusion of irrelevant codes. We used model-grading (i.e., grading by another LLM with the reference codelists provided as context) to evaluate the output codelists (a score of 0% being all incorrect and 100% being all correct). Results: We saw varying accuracy across models and codelists, with Gemini 3 Pro (Score 43%) generally performing better than Claude Sonnet 4.6 (36%), Gemini 3 Flash, and OpenAI GPT 5.2 performing worst (14%). Models performed better with shorter target codelists (e.g., Eosinophilic esophagitis with four codes, and Hidradenitis suppurativa with 14 codes). For example, all models consistently failed to produce a complete Wrist fracture codelist (with 214 required codes). We further present evaluation summaries, and failure mode evaluations produced by parsing LLM chat logs. Discussion: Besides demonstrating that a single-shot RAG approach is currently not suitable for codelist generation, we demonstrate failure modes including hallucinations, retrieval failures and generation failures where retrieved codes are not used. Conclusions: Our findings suggest that while RAG systems using current frontier LLMs may create correct clinical codelists in some cases, they still struggle with large and complex terminologies and codelists with a large number of codes. The failure mode we highlight can inform the creation of future workflows to avoid failures.

12
MIMIC-IV-Phenotype-Atlas (MIPA) : A Publicly Available Dataset for EHR Phenotyping

Yamga, E.; Goudrar, R.; Despres, P.

2026-04-24 health informatics 10.64898/2026.04.16.26350888 medRxiv
Top 0.3%
2.3%
Show abstract

Introduction Secondary use of electronic health records (EHRs) often requires transforming raw clinical information into research-grade data. A central step in this process is EHR phenotyping - the identification of patient cohorts defined by specific medical conditions. Although numerous approaches exist, from ICD-based heuristics to supervised learning and large language models (LLMs), the field lacks standardized benchmark datasets, limiting reproducibility and hindering fair comparison across methods. Methods We developed the MIMIC-IV Phenotype Atlas (MIPA) dataset, an adaptation of MIMIC-IV that provides expert-annotated discharge summaries across 16 phenotypes of varying prevalence and complexity. Two independent clinicians reviewed and labeled the discharge summaries, resolving disagreements by consensus. In parallel, we implemented a processing pipeline that extracts multimodal EHR features and generates training, validation, and testing datasets for supervised phenotyping. To illustrate MIPA's utility, we benchmarked four phenotyping methods : ICD-based classifiers, keyword-driven Term Frequency-Inverse Document Frequency (TF-IDF) classifiers, supervised machine learning (ML) models, and LLMs on the task. Results The final MIPA corpus consists of 1,388 expert-annotated discharge summaries. Annotation reliability was high (mean document-level kappa = 0.805, mean label-level kappa = 0.771), with 91% of disagreements resolved through consensus review. MIPA provides high-quality phenotype labels paired with structured EHR features and predefined train/validation/test splits for each phenotype. In the benchmarking case study, LLMs achieved the highest F1 scores in 13 of 16 phenotypes, particularly for conditions requiring contextual interpretation of clinical narrative, while supervised ML offered moderate improvements over rule-based baselines. Conclusion MIPA is the first publicly available benchmark dataset dedicated to EHR phenotyping, combining expert-curated annotations, broad phenotype coverage, and a reproducible processing pipeline. By enabling standardized comparison across ICD-based heuristics, ML models, and LLMs, MIPA provides a durable reference resource to advance methodological development in automated phenotyping.

13
A Systematic Exploration of LLM Behavior for EHR phenotyping

Yamga, E.; Murphy, S.; Despres, P.

2026-04-24 health informatics 10.64898/2026.04.16.26350890 medRxiv
Top 0.4%
2.1%
Show abstract

Background Electronic health record (EHR) phenotyping underpins observational research, cohort discovery, and clinical trial screening. Large language models (LLMs) offer new capabilities for extracting phenotypes from unstructured text, but their performance depends on pipeline design choices-including prompting, text segmentation, and aggregation. No systematic framework has previously examined how these parameters shape accuracy and reproducibility. Methods We evaluated LLM-based phenotyping pipelines using 1,388 discharge summaries across 16 clinical phenotypes. A full factorial experiment with LLaMA-3B, 8B, and 70B systematically varied three pipeline components: prompting (zero-shot, few-shot, chain-of-thought, extract-then-phenotype), chunking (none, naive, document-based), and aggregation (any-positive, two-vote, majority), yielding 24 configurations per model. To compare intrinsic model capabilities, biomedical domain-adapted, commercial frontier (LLaMA-405B, GPT-4o, Gemini Flash 2.0), and reasoning-optimized models (DeepSeek-R1) were evaluated under a fixed configuration. Performance was assessed using precision, recall, and macro-F1; secondary analyses examined prediction consistency (Shannon entropy), self-confidence calibration, and the development of a taxonomy of recurrent model errors. Results Factorial ANOVAs showed that chunking and aggregation were the dominant drivers of performance, whereas the prompting strategy contributed minimally. Configuration effects were stable across model sizes, with no significant Model x Parameter interactions. Phenotype difficulty varied substantially (macro-F1 = 0.40-0.90), yet the highest-performing configuration-whole-document inference without aggregation-was consistent across phenotypes, as confirmed by mixed-effects modeling. In cross-model comparisons, DeepSeek-R1 achieved the highest macro-F1 (0.89), while LLaMA-70B matched GPT-4o and LLaMA-405B at substantially lower cost. Prediction entropy was low overall and driven primarily by phenotype difficulty rather than prompting or temperature. Self-confidence calibration was only moderately informative: high-confidence predictions were more accurate, but larger models exhibited systematic overconfidence. Conclusions LLM performance in EHR phenotyping is governed primarily by input structure and model capacity, not prompt engineering. Simple, document-level inference yields robust performance across diverse phenotypes, providing practical design guidance for LLM-based cohort identification while underscoring the continued need for human oversight for challenging phenotypes.

14
Interpretability as stability under perturbation reveals systematic inconsistencies in feature attribution

Piorkowska, N. J.; Olejnik, A.; Ostromecki, A.; Kuliczkowski, W.; Mysiak, A.; Bil-Lula, I.

2026-04-22 health informatics 10.64898/2026.04.20.26351354 medRxiv
Top 0.4%
1.7%
Show abstract

Interpreting machine learning models typically relies on feature attribution methods that quantify the contribution of individual variables to model predictions. However, it remains unclear whether attribution magnitude reflects the true functional importance of features for model performance. Here, we present a unified interpretability framework integrating permutation-based attribution, feature ablation, and stability under perturbation across multiple feature spaces. Using nested cross-validation and permutation-based null diagnostics, we systematically evaluate the relationship between attribution magnitude and functional dependence in clinical and biomarker-based prediction models. Attribution magnitude is frequently misaligned with functional importance, with weak to strong negative correlations observed across feature spaces (Spearman {rho} ranging from -0.374 to -0.917). Features with high attribution often have limited impact on model performance when removed, whereas features with low attribution can be essential for maintaining predictive accuracy. These discrepancies define distinct classes of interpretability failure, including attribution excess and latent dependence. Interpretability further depends on feature space composition, and stable, functionally relevant features are not necessarily those with the highest attribution scores. By integrating attribution, functional impact, and stability into a composite Feature Reliability Score, we identify features that remain informative across perturbations and analytical contexts. These findings indicate that interpretability does not arise from attribution magnitude alone but is better characterized from stability under perturbation. This framework provides a basis for more robust model interpretation and highlights limitations of attribution-centric approaches in high-dimensional and correlated data settings.

15
MedSAM2-CXR: A Box-Latent Framework for Chest X-ray Classification and Report Generation

Hakata, Y.; Oikawa, M.; Fujisawa, S.

2026-04-22 health informatics 10.64898/2026.04.20.26351338 medRxiv
Top 0.5%
1.5%
Show abstract

Who is affectedIn Japan, approximately 100 million chest radiographs (CXRs) are acquired annually, while only about 7,000 board-certified diagnostic radiologists practice nationwide (Japan Radiological Society workforce statistics; OECD Health Statistics, most recent available year). This implies an average workload exceeding 10,000 imaging studies per radiologist per year if all CXRs were attributed to board-certified diagnostic radiologists (an upper-bound estimate, because in practice many CXRs are primarily read by non-radiologist physicians). In settings such as night shifts, weekends, remote islands, and regional care networks, non-radiologist physicians frequently act as primary readers. Despite strong demand for AI assistance, existing systems are typically limited by one of three shortcomings -- poor cross-institutional generalization, limited interpretability, or inability to generate draft reports -- and consequently see limited clinical deployment. What we builtWe propose a Box-Latent Trinity that embeds each image as a hyperrectangle parameterized by a center c and a radius r, rather than as a single point in a latent space. We further introduce BL-TTA (Box-Latent Test-Time Augmentation), which approximately closes the train-inference gap (exact in the N [->] {infty} limit; N = 8 suffices in practice) by averaging predictions over samples drawn from within the latent box at inference time. Both components are implemented on top of the frozen MedSAM2 medical imaging foundation model. A single box representation simultaneously supports three functions: (A) theoretically grounded source selection, (B) device-invariant augmentation, and (C) case-based retrieval-augmented generation (RAG). Each prediction is accompanied by retrieved similar prior cases, a calibrated confidence estimate, and clinical-guideline references. How well it performsOn the Open-i CXR corpus (2,954 image-report pairs) under a patient-level 80/10/10 split and 5-seed reproducibility, the full system B5 achieves macro area under the receiver-operating-characteristic curve (macro-AUROC) 0.639 (best-seed test; 5-seed mean 0.626, Table 2; absolute +0.015 over the strongest same-backbone baseline, Merlin-style 0.624), elementwise accuracy 0.753 (absolute +0.072 over Merlin-style 0.681 -- equivalent to approximately 7 fewer label-level errors per 100 (label, image) predictions across 14 finding labels, not per 100 images), and report label-F1 0.435 (absolute +0.086, relative +25 % over the strongest same-backbone report-generation baseline, Bootstrapping-style 0.349). Under simulated pixel-space device-shift intensities up to twice the training distribution, AUROC degrades by only 0.014. Brier score (macro) is 0.061; Cohens{kappa} between two independent rule-based label extractors is 0.702 (substantial agreement); the box radius yields an out-of-distribution (OOD) detection AUROC of 0.595; and the framework provides four structural explainable-AI (XAI) outputs -- retrieved similar cases, confidence tier, per-axis uncertainty, and visual saliency -- which we jointly quantify in a single CXR study, a combination that, to our knowledge, has not been reported previously. O_TBL View this table: org.highwire.dtl.DTLVardef@d8ced6org.highwire.dtl.DTLVardef@1f3471dorg.highwire.dtl.DTLVardef@c1c2f1org.highwire.dtl.DTLVardef@e589bdorg.highwire.dtl.DTLVardef@1b5e410_HPS_FORMAT_FIGEXP M_TBL C_TBL Path to deploymentBecause the complete experiment can be reproduced in under two hours on a consumer-grade GPU (NVIDIA RTX 4060, 8 GB VRAM), the framework can run on compute resources already available at typical healthcare institutions. The approach thus supports the practical delivery of evidence-grounded diagnostic support to night shifts, remote-island care, and secondary readings in health checkups -- settings in which a board-certified radiologist is not locally available. One-sentence summaryReproducible end-to-end in under two hours on a single consumer-grade GPU, the proposed framework outperforms the strongest same-backbone medical-AI baselines on three principal metrics, maintains accuracy under simulated device shifts, and automatically drafts evidence-grounded radiology reports, offering a reproducible and compute-efficient direction toward reducing the reading burden of Japanese radiologists, subject to external validation.

16
A catalogue of missense and nonsense mutation abundances for the U.S. cancer patient population

Arun, A.; Liarakos, D.; Mendiratta, G.; McFall, T.; Hargreaves, D. C.; Wahl, G. M.; Hu, J.; Stites, E. C.

2026-04-22 oncology 10.64898/2026.04.20.26351248 medRxiv
Top 0.6%
1.2%
Show abstract

Widespread genomic sequencing efforts have characterized the molecular foundations of the different cancers. By combining these genomic data in a manner proportional to the population-level abundances of these different cancers, we estimate the overall abundances of each observed missense and nonsense mutation within the U.S. cancer patient population. We find BRAF V600E (5.2%) is the most common mutation in the cancer patient population, TP53 R175H (1.5%) is the most common tumor suppressor mutation, and APC R876X (0.4%) is the most common nonsense mutation. These values differ largely and significantly from what would be found in a typical pan-cancer analysis, where different cancer types are included out of proportion to population level incidence. We present the full ordered lists of population-level abundances for specific missense and nonsense mutations, and we demonstrate the value of these data by further analyzing high priority genes (e.g., TP53, KRAS, BRAF) and pathways (e.g., RTK/RAS, PI3K, and WNT/{beta}-catenin). Overall, this information is a resource that should benefit the basic science, translational, and clinical cancer research communities.

17
Practical Management of Adverse Events Associated with Bispecific Antibodies for the Treatment of Multiple Myeloma: A Qualitative Interview Study

Graham, T. R.; White, M. G.; Blue, B.; Hartley-Brown, M.; Hunter, B. D.; Huynh, C.; Joseph, N.; Keruakous, A.; Pan, D.; Rudolph, P.; Sawhney, R.; Suvannasankha, A.

2026-04-27 oncology 10.64898/2026.04.24.26350878 medRxiv
Top 0.7%
0.9%
Show abstract

PURPOSE: Bispecific antibodies (BsAbs) represent a major advancement in the management of relapsed/refractory multiple myeloma (RRMM), offering high response rates even in heavily pretreated patients. However, their use presents operational, safety, and supportive care complexities that require coordinated care teams, and evolving infrastructure. This manuscript summarizes best practice recommendations for adverse event (AE) management, outpatient operational models, referral pathways, and emerging strategies to optimize long-term tolerability. METHODS: Medlive, A PlatformQ Health Brand, conducted qualitative interviews of academic and community-based clinicians. Discussions focused on BsAb implementation, patient selection and counseling, and AE management. Experts provided recommendations on team-based protocols, transitions of care, and inpatient versus outpatient considerations. RESULTS: Ten hematologists/oncologists (academic n=4; community n=6) described practice patterns, barriers, and perspectives on BsAb use. BsAbs were consistently regarded as highly effective across multiple lines of therapy, particularly for patients without alternatives. Cytokine release syndrome (CRS) was the most common acute toxicity, generally low grade and managed effectively with early tocilizumab, including prophylactic use in outpatient settings. Immune effector cell-associated neurotoxicity syndrome (ICANS) was rare, mild, and best mitigated through early recognition and caregiver support. Infections, largely from BCMA-associated hypogammaglobulinemia, frequently interrupted therapy, necessitating antiviral prophylaxis, pneumocystis jirovecii pneumonia (PJP) prophylaxis, and intravenous immunoglobulin (IVIG). Outpatient step-up dosing is expanding, supported by prophylactic strategies and academic-community collaboration. Timely referral was emphasized to preserving eligibility. Major outpatient challenges included sequencing, infrastructure readiness, and standardized caregiver and staff education. CONCLUSION: Effective community implementation of BsAbs requires multidisciplinary coordination, standardized AE protocols, infection prevention, and infrastructure to support monitoring, referrals, and equitable access. These measures are critical to ensure safe, sustainable integration of bispecific therapies and to optimize patient outcomes.

18
Tumor Biology and Patterns of Recurrence in High-Grade Glioma: Implications for Radiation Target Delineation

Barve, R.; Gowda, D.; Illiayaraja, K. J.

2026-04-25 oncology 10.64898/2026.04.23.26351633 medRxiv
Top 0.7%
0.9%
Show abstract

Abstract: Purpose: Recurrence in high grade glioma (HGG) predominantly occurs within the high dose radiation field, raising the question of whether treatment failure reflects limitations in radiation target delineation or is driven by intrinsic tumor biology. This study evaluated recurrence patterns following standard chemoradiotherapy and their treatment implications. Material and Methods: This retrospective single center study included 41 patients with histologically confirmed HGG treated with surgery followed by radiotherapy with concurrent and adjuvant temozolomide (TMZ). Patients were followed through August 2018; those with recurrence were included in the analysis. Recurrence patterns were classified based on their spatial relationship to the 60 Gy isodose line as central, infield, marginal, or distant. Survival outcomes were estimated using the Kaplan-Meier method and compared using the log rank test. Results: The most common pattern of recurrence was central (15 patients, 36.5%), followed by infield (11, 26.8%), distant (6, 14.6%), marginal (5, 12.1%), and multicentric (4, 9.8%). Central and in field recurrences (local failures) accounted for 26 patients (63%). Median overall survival (OS) was 27 months, and median progression-free survival (PFS) was 12 months. Survival differed significantly by recurrence pattern (log-rank p = 0.018), with marginal recurrence associated with more favorable outcomes. Conclusion: The predominance of central and infield recurrences within the high-dose region suggests that treatment failure in HGG is not solely explained by inadequate target delineation and may also be driven, in part, by intrinsic tumor biology, including radioresistant subpopulations and tumor heterogeneity. Future strategies may benefit from incorporating biologically guided approaches alongside optimization of radiation treatment parameters.

19
Multi-Hospital Electronic Health Record Foundation Models Without Data Sharing: A Comparison of Federated Learning and Inference-Time Ensembling

Elemento, O.

2026-04-27 health informatics 10.64898/2026.04.24.26351702 medRxiv
Top 0.8%
0.8%
Show abstract

Background. Foundation models for electronic health records (EHRs) perform strongly on clinical prediction, but every published model has been trained within a single health system. No multi-institutional EHR foundation model currently exists, largely because privacy regulations and governance barriers block data pooling across hospitals. Two strategies could build such models without pooling: federated learning (exchanges model weights) and inference-time ensembling (exchanges only predictions at query time). Whether either is viable for autoregressive EHR foundation models, and whether individual hospitals benefit from participating, is not established. Methods. We trained a generative pretrained transformer (GPT) style EHR foundation model on 100,163 Medical Information Mart for Intensive Care (MIMIC-IV) patients, partitioned into five heterogeneously distributed (non-IID) sites by Dirichlet allocation over International Classification of Diseases (ICD) chapters. We compared centralized training, federated averaging, and inference-time ensembling, and each hospital's solo model against the ensemble including it. Models were evaluated on 15,012 held-out patients using per-condition area under the receiver operating characteristic curve (AUROC) for five acute conditions and micro-averaged area under the precision-recall curve (AUPRC) across 2,590 diagnoses. Results. Centralized training achieved per-condition AUROC 0.75-0.85 and overall AUPRC 0.376. Federated averaging recovered 85% of centralized AUPRC (0.321) and 98-100% of per-condition AUROC. Inference-time ensembling, requiring no training-time exchange, recovered 77% of AUPRC (0.291) and 97-99% of per-condition AUROC. An estimated 87% of participating hospitals received a better model from the ensemble than from training alone; only hospitals with ~40% of the network's patients matched the ensemble on their own. FedProx collapsed to the marginal baseline. Conclusions. Multi-institutional EHR foundation models can be built without pooling patient data. Inference-time ensembling benefits most participating hospitals and imposes the lightest governance burden; federated learning recovers more performance but requires weight sharing. These findings offer a practical path toward collaborative clinical AI.

20
Histology-Derived Signatures Predict Recurrence Risk and Chemotherapy Benefit in Randomized Trials of Early Breast Cancer

Howard, F. M.; Li, A.; Kochanny, S.; Sullivan, M.; Flores, E. M.; Dolezal, J.; Khramtsova, G.; Hassan, S.; Medenwald, R.; Saha, P.; Fan, C.; McCart, L.; Watson, M.; Teras, L. R.; Bodelon, C.; Patel, A. V.; Symmans, W. F.; Partridge, A.; Carey, L.; Olopade, O. I.; Stover, D.; Perou, C.; Yao, K.; Pearson, A. T.; Huo, D.

2026-04-24 oncology 10.64898/2026.04.23.26351499 medRxiv
Top 0.8%
0.8%
Show abstract

Purpose: To test whether histology-derived gene-expression signatures from routine hematoxylin and eosin slides are prognostic for recurrence and predictive of chemotherapy benefit in early breast cancer. Methods: We conducted a multi-cohort study including CALGB 9344 (anthracycline +/- paclitaxel), CALGB 9741 (standard vs dose-dense chemotherapy), a pooled Chicago real-world cohort, and the American Cancer Society (ACS) Cancer Prevention Studies-II and -3. Whole-slide images were processed with a previously described pipeline to generate 61 histology-derived signatures per patient. The primary endpoint was distant recurrence-free interval (DRFI), except in ACS, where breast cancer-specific survival was used. Secondary endpoints include distant recurrence-free survival (DRFS) and overall survival. The most prognostic signature in CALGB 9344, selected by Harrell's C-index, was evaluated in additional cohorts. Signature-treatment interaction was assessed by likelihood-ratio tests. Multivariable Cox models incorporating age, tumor size, nodal status, estrogen/progesterone receptor status, and signature were fit in CALGB 9344 to improve risk stratification. Results: A total of 7,170 patients were included across four cohorts. The top histology-derived signature in CALGB 9344 showed strong prognostic performance for 5-year DRFI (C-index 0.63) and performed well across validation cohorts (C-index 0.60, 0.70, and 0.62 in CALGB 9741, Chicago, and ACS, respectively). The strongest predictive signal for treatment benefit was observed for DRFS. High-risk cases identified by the signature demonstrated greater benefit from taxane in CALGB 9344 (adjusted hazard ratio [aHR] 0.76 for DRFS, 95% CI 0.66-0.88; interaction p=0.028), from dose-dense chemotherapy in CALGB 9741 (aHR 0.69, 95% CI 0.56-0.85; interaction p=0.039), and differential chemotherapy benefit in the Chicago cohort (aHR 0.84, 95% CI 0.59-1.21; interaction p=0.009). Combined clinical-histology models improved risk stratification and identified low-risk groups with a 2%-10% risk of distant recurrence or breast cancer death. Conclusion: Histology-derived signatures from H&E images are broadly prognostic and, unlike clinical factors, may predict chemotherapy benefit.